Digging Deeper into Deep Web Databases by Breaking Through the Top-k Barrier

نویسندگان

  • Saravanan Thirumuruganathan
  • Nan Zhang
  • Gautam Das
چکیده

A large number of web databases are only accessible through proprietary form-like interfaces which require users to query the system by entering desired values for a few attributes. A key restriction enforced by such an interface is the top-k output constraint i.e., when there are a large number of matching tuples, only a few (top-k) of them are preferentially selected and returned by the website, often according to a proprietary ranking function. Since most web database owners set k to be a small value, the top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. In this paper we consider the novel problem of “digging deeper” into such web databases. Our main contribution is the meta-algorithm GetNext that can retrieve the next ranked tuple from the hidden web database using only the restrictive interface of a web database without any prior knowledge of its ranking function. This algorithm can then be called iteratively to retrieve as many top ranked tuples as necessary. We develop principled and efficient algorithms that are based on generating and executing multiple reformulated queries and inferring the next ranked tuple from their returned results. We provide theoretical analysis of our algorithms, as well as extensive experimental results over synthetic and real-world databases that illustrate the effectiveness of our techniques.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Retrieving Deep Web Data Based on Heuristic Hierarchy Tree Model ⋆

Deep Web data refers to a dataset that allows user to query through a search interface, and be rendered in dynamically generated web page, generally topic-based. However, many web database interfaces limit the number k of relevant tuples returned for each query submitted by user, which denotes top-k problem. To address this problem, we propose a novel method to prune hierarchy tree, which aims ...

متن کامل

Automatic Hierarchical Classification of Structured Deep Web Databases

We present a method that automatically classifies structured deep Web databases according to a pre-defined topic hierarchy. We assume that there are some manually classified databases, i.e., training databases, in every node of the topic hierarchy. Each training database is probed using queries constructed from the node titles of the topic hierarchy and the query result counts reported by the d...

متن کامل

KEYRY: A Keyword-Based Search Engine over Relational Databases Based on a Hidden Markov Model

We propose the demonstration of KEYRY, a tool for translating keyword queries over structured data sources into queries in the native language of the data source. KEYRY does not assume any prior knowledge of the source contents. This allows it to be used in situations where traditional keyword search techniques over structured data that require such a knowledge cannot be applied, i.e., sources ...

متن کامل

Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment

Data enrichment is the act of extending a local database with new attributes from external data sources. In this paper, we study a novel problem—how to progressively crawl the deep web (i.e., a hidden database) through a keywordsearch interface for data enrichment. This is challenging because these interfaces often enforce a top-k constraint, or they have limits on the number of queries that ca...

متن کامل

Query Planning for Searching Inter-dependent Deep-Web Databases

Increasingly, many data sources appear as online databases, hidden behind query forms, thus forming what is referred to as the deep web. It is desirable to have systems that can provide a high-level and simple interface for users to query such data sources, and can automate data retrieval from the deep web. However, such systems need to address the following challenges. First, in most cases, no...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1208.3876  شماره 

صفحات  -

تاریخ انتشار 2012